Course: COMPUTATIONAL THINKING FOR GOVERNANCE ANALYTICS

Prof. José Manuel Magallanes, PhD

  • Visiting Professor of Computational Policy at Evans School of Public Policy and Governance, and eScience Institute Senior Data Science Fellow, University of Washington.
  • Professor of Government and Political Methodology, Pontificia Universidad Católica del Perú.

Dimensionality Reduction in R

As the name implies, we want to reduce a set of several variables into few variables. In this session, we will practice two basic techniques:

Cluster Analysis

Simply speaking, this technique will organize the cases (rows) into a small set of groups, based on the information (the columns) available for each case. This technique will create a new variable, which will be of categorical type, generally ordinal.

Let me bring back the data we prepared in Python:

# clean memory
rm(list = ls()) 

# link to data file
link='https://github.com/EvansDataScience/CTforGA_integrating/raw/main/demfragiso_expo.RDS'

# a RDS file from the web needs:
myFile=url(link)

# reading in the data:
fromPy=readRDS(file = myFile)

# reset indexes to R paradigm:
row.names(fromPy)=NULL

# check data types
str(fromPy)
## 'data.frame':    141 obs. of  25 variables:
##  $ Countryname                 : chr  "Afghanistan" "Albania" "Algeria" "Angola" ...
##  $ Officialstatename           : chr  "The Islamic Republic of Afghanistan" "The Republic of Albania" "The People's Democratic Republic of Algeria" "The Republic of Angola" ...
##  $ InternetccTLD               : chr  ".af" ".al" ".dz" ".ao" ...
##  $ iso2                        : chr  "AF" "AL" "DZ" "AO" ...
##  $ iso3                        : chr  "AFG" "ALB" "DZA" "AGO" ...
##  $ Regimetype                  : Ord.factor w/ 4 levels "Authoritarian"<..: 1 3 1 1 3 2 4 4 1 1 ...
##  $ Overallscore                : num  0.32 6.11 3.77 3.37 6.81 5.49 8.9 8.07 2.68 2.52 ...
##  $ Electoralprocessandpluralism: num  0 7 3.08 1.33 9.17 7.5 10 9.58 0.5 0.42 ...
##  $ Functioningofgovernment     : num  0.07 6.43 2.5 2.86 5 5.71 8.57 6.79 2.5 2.71 ...
##  $ Politicalparticipation      : num  0 4.44 4.44 5 7.22 6.11 7.78 8.89 2.78 3.33 ...
##  $ Politicalculture            : num  1.25 5.63 5 5 5 3.13 8.75 6.88 5 4.38 ...
##  $ Civilliberties              : num  0.29 7.06 3.82 2.65 7.65 5 9.41 8.24 2.65 1.76 ...
##  $ Total                       : num  8.99 4.48 6.01 7.62 3.55 ...
##  $ X1ExternalIntervention      : num  8.3 6.1 3.7 4.6 4.3 7 0.5 0.5 7.2 5.3 ...
##  $ C1SecurityApparatus         : num  10 4.8 6 7.2 4.9 5.7 2.7 1.6 6.4 5.9 ...
##  $ P2PublicServices            : num  9.8 4.4 5.6 9.3 4.8 3.9 2.8 2.3 5.5 3.5 ...
##  $ E3HumanFlightandBrainDrain  : num  7 8.3 5.5 6 3 6.8 0.5 1.6 4.3 3 ...
##  $ E2EconomicInequality        : num  8.1 2.9 5.6 8.9 4.9 3.6 1.8 2.3 5.1 5.3 ...
##  $ S2RefugeesandIDPs           : num  8.8 2.6 6.8 5.9 1.9 6.6 2 4.4 6.9 1.7 ...
##  $ C3GroupGrievance            : num  7.2 4.1 7.2 8.1 3.8 5.3 3.1 3.9 6.1 9.6 ...
##  $ P1StateLegitimacy           : num  8.7 5.5 7.8 8.2 4 6.9 0.5 0.6 9.1 8 ...
##  $ C2FactionalizedElites       : num  8.6 6.2 7.5 7.2 2.8 7 1.7 3.2 7.9 7.6 ...
##  $ S1DemographicPressures      : num  9 4.1 4.8 9 5.3 4.4 2.9 3.4 4.2 4.1 ...
##  $ E1Economy                   : num  9.2 6.4 6.8 8.4 7.1 6.6 1.6 1.8 4.7 4.1 ...
##  $ P3HumanRights               : num  7.4 3.6 6.3 6.2 3.3 6 1.7 0.5 7.7 8.6 ...

I. Prepare data:

We have several countries, and several columns. In clustering, we try to create groups so that we have the highest homogeneity within groups, and the highest heterogeneity between groups. The variables will serve compute some distance among the cases so that the clustering algorithms will try to detect the homogeneity and heterogeneity mentioned.

Here you should subset the data: For this case just keep the columns with numeric values without the categories or the summary scores, and renaming the index with the country names:

# select variables
dataToCluster=fromPy[,-c(1,2:7,13)]
#save the country names as the row index
row.names(dataToCluster)=fromPy$Countryname

II. Compute the DISTANCE MATRIX:

  1. Set random seed: Multivariate techniques may use some random processes; so managing that process allows replication of results.
set.seed(999) 
  1. Decide distance method and compute distance matrix: We use gower as it allows for multiple data types. The default, the euclidian, works well when the data are numeric.
library(cluster)
distanceMatrix=daisy(x=dataToCluster, metric = "gower")
  1. Represent distances: Once you have the distanceMatrix you can use it to locate each case on a simpler space, in this case, let’s use a bidimensional representation. The function cmdscale can produce a two-dimension map of points using distanceMatrix:
projectedData = cmdscale(distanceMatrix, k=2)

The object projectedData is saving coordinates for each element in the data:

# save coordinates to original data frame:
fromPy$dim1 = projectedData[,1]
fromPy$dim2 = projectedData[,2]

# see some:

fromPy[,c('dim1','dim2')][1:10,]
##           dim1         dim2
## 1   0.40766940  0.002254166
## 2  -0.06708838 -0.024849688
## 3   0.10496448  0.048467522
## 4   0.21713312 -0.019916926
## 5  -0.12747131 -0.025454222
## 6   0.02232614 -0.016713408
## 7  -0.38544488  0.015037246
## 8  -0.32517748 -0.003159269
## 9   0.13881726  0.121031763
## 10  0.08468431  0.202137453
  1. Use those points and see a “map” of the cases:
library(ggplot2)
base= ggplot(data=fromPy,
             aes(x=dim1, y=dim2,
                 label=Countryname)) 
base + geom_text(size=2)

Can you see some groups emerging?

An alternative is the dendogram:

# prepare hierarchical cluster
hc = hclust(distanceMatrix)
# very simple dendrogram
plot(hc,hang = -1,cex=0.5)

Compute Clusters

There are several techniques for clustering, each with pros and cons. We will review the hierarchical clustering here. As the name implies, it will create clusters by creating all the possible clusters, so it is up to us to identify how many clusters emerge. The dendogram helps to identify how many clusters can be found. The hierarchical approach has two different strategies. The agglomerative approach starts by considering each case (row) a cluster, and then, using the distances links the individual cases. It is not expected that all cases are linked during the first step. The second step will create more clusters, and so on. The linking process can also vary (we will use the ward linkage). On the other hand, the divisive approach starts by considering all the cases as one cluster, and then divides them.

1. Compute clustering suggestions

Using function fviz_nbclust from the library factoextra, we can see how many clustered are suggested.

  1. For hierarchical (agglomerative):
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_nbclust(dataToCluster, 
             hcut,
             diss=distanceMatrix,
             method = "gap_stat",
             k.max = 10,
             verbose = F,
             hc_func = "agnes")

  1. For hierarchical (divisive):
fviz_nbclust(dataToCluster, 
             hcut,
             diss=distanceMatrix,
             method = "gap_stat",
             k.max = 10,
             verbose = F,
             hc_func = "diana")

We could accept the number of cluster suggested or not. Let’s use the suggestion:

2. Compute clusters:

Here you can compute the clusters. I am using both, the aggregative and the divisive . Notice that you have to indicate a priori the amount of clusters required.

NumberOfClusterDesired=4


#library(factoextra)
res.agnes= hcut(distanceMatrix, 
                k = NumberOfClusterDesired,
                isdiss=TRUE,
                hc_func='agnes',
                hc_method = "ward.D2")

# Hierarchical technique- divisive approach
res.diana= hcut(distanceMatrix, 
                k = NumberOfClusterDesired,
                isdiss=TRUE,
                hc_func='diana',
                hc_method = "ward.D2")

3. Results from Clustering

3.1 Save results to original data frame:

fromPy$agn=as.factor(res.agnes$cluster)
fromPy$dia=as.factor(res.diana$cluster)

3.2 Verify ordinality in clusters: Each cluster has a label. Should the result represent some ordering, the labels does not reflect that necessarily. Below, I show that I used the columns Overallscore to get an idea of the real ordering that the labels represent.

aggregate(data=fromPy,
          Overallscore~agn,
          FUN=mean)
##   agn Overallscore
## 1   1     3.232683
## 2   2     6.193333
## 3   3     8.196061
## 4   4     2.575625
aggregate(data=fromPy,
          Overallscore~dia,
          FUN=mean)
##   dia Overallscore
## 1   1     2.900652
## 2   2     7.014783
## 3   3     5.346818
## 4   4     8.216071

You could recode these values so that the labels represent an ascending order.

library(dplyr)

fromPy$agn=dplyr::recode_factor(fromPy$agn, 
                  `1` = '2',`2`='3',`3`='4',`4`='1')
fromPy$dia=dplyr::recode_factor(fromPy$dia, 
                  `1` = '1',`2`='3',`3`='2',`4`='4')

4. Evaluate Results

The hierarchical clustering process returns the silhouette information for each observation, a measure of how well a case has been classified. Silhouettes vary from -1 to +1, where the higher the positive value the better classified a case is. Low positive values informs of poor clustering. Negative values informs of wrongly clustered cases.

4.1 Plot silhouettes

fviz_silhouette(res.agnes)
##   cluster size ave.sil.width
## 1       1   41          0.21
## 2       2   51          0.18
## 3       3   33          0.51
## 4       4   16          0.35

library(factoextra)
fviz_silhouette(res.diana)
##   cluster size ave.sil.width
## 1       1   46          0.24
## 2       2   23          0.23
## 3       3   44          0.13
## 4       4   28          0.31

4.2 Detecting cases wrongly clustered

  1. Save individual silhouettes:

Previos results have saved important information. Let me keep the negative sihouette values:

agnEval=data.frame(res.agnes$silinfo$widths)
diaEval=data.frame(res.diana$silinfo$widths)

agnPoor=rownames(agnEval[agnEval$sil_width<0,])
diaPoor=rownames(diaEval[diaEval$sil_width<0,])

Now, I can see what countries are not well clustered:

library("qpcR") 
bad_Clus=as.data.frame(qpcR:::cbind.na(sort(agnPoor),
                                       sort(diaPoor)))
names(bad_Clus)=c("agn","dia")
bad_Clus
##             agn             dia
## 1       Algeria         Czechia
## 2    Bangladesh           Ghana
## 3         Benin          Guyana
## 4         Chile          Israel
## 5         Egypt         Liberia
## 6         Gabon         Namibia
## 7        Jordan North Macedonia
## 8    Kyrgyzstan           Qatar
## 9       Liberia          Serbia
## 10      Morocco    Sierra Leone
## 11 Sierra Leone           Spain
## 12    Sri Lanka          Zambia
## 13       Turkey            <NA>

How to compare clustering?

  1. Color the maps:

    • For Hierarchical AGNES:
base= ggplot(data=fromPy,
             aes(x=dim1, y=dim2,
                 label=Countryname)) 
agnPlot=base + labs(title = "AGNES") + geom_point(size=2,
                                              aes(color=agn),
                                              show.legend = T) 
-   For Hierarchical DIANA:
diaPlot=base + labs(title = "DIANA") + geom_point(size=2,
                                              aes(color=dia),
                                              show.legend = T) 
  1. Compare visually:
library(ggpubr)

ggarrange(agnPlot, diaPlot,ncol = 2,common.legend = T)

  1. Annotating outliers:

    • Prepare labels:
# If name of country in black list, use it, else get rid of it
LABELdia=ifelse(fromPy$Countryname%in%diaPoor,fromPy$Countryname,"")
LABELagn=ifelse(fromPy$Countryname%in%agnPoor,fromPy$Countryname,"")
- Prepare plot with the outlying labels:
library(ggrepel)
diaPlot=diaPlot + 
        geom_text_repel(aes(label=LABELdia),
                        max.overlaps=50,
                        min.segment.length =unit(0,'lines'))
agnPlot= agnPlot + 
         geom_text_repel(aes(label=LABELagn),
                         max.overlaps = 50,
                         min.segment.length = unit(0, 'lines'))
- Plot and compare:
ggarrange(agnPlot, 
          diaPlot,
          ncol = 2,
          common.legend = T)

  1. Produce Dendograms for the results:
fviz_dend(res.agnes,
          k=NumberOfClusterDesired, 
          cex = 0.45, 
          horiz = T,
          main = "AGNES approach")

fviz_dend(res.diana,
          k=NumberOfClusterDesired, 
          cex = 0.45, 
          horiz = T,
          main = "DIANA approach")

Let’s compare these clusters with the levels proposed by The Economist:

table(fromPy$Regimetype,fromPy$agn)
##                   
##                     2  3  4  1
##   Authoritarian    31  0  0 16
##   Hybrid regime     9 20  0  0
##   Flawed democracy  1 31 15  0
##   Full democracy    0  0 18  0
table(fromPy$Regimetype,fromPy$dia)
##                   
##                     1  3  2  4
##   Authoritarian    38  0  8  1
##   Hybrid regime     8  0 21  0
##   Flawed democracy  0 22 15 10
##   Full democracy    0  1  0 17

This way, you can use the clustering results to validate other classifications done theoretically or by simpler techniques (i.e. averaging).

FACTOR ANALYSIS

Simply speaking, this technique tries to express in one (or few) dimension(s) the behavior of several others. FA assumes that the several input variables have ‘something’ in common, there is something latent that the set of input variables represent.

Let follow this steps:

1. Subset our original data frame:

Our dataForFA has the same data as the one we used for clustering.

dataForFA=dataToCluster

We know that we have two sets of variables, one related to democracy and other to fragility, all of them computed at the country level.

2. Compute the correlations:

In Factor analysis we need a measure of similarity among variables (not cases). Let me compute the heterogenous correlation matrix (Pearson correlations between numeric variables, polyserial correlations between numeric and ordinal variables, and polychoric correlations between ordinal variables).

library(polycor)
corMatrix=polycor::hetcor(dataForFA)$correlations

We can visualize this matrix:

library(ggcorrplot)

ggcorrplot(corMatrix,
           type = "lower") + 
          theme(axis.text.x  = element_text(size = 5),
                axis.text.y  = element_text(size = 5))

You should notice that the set of variables that belong to a concept are correlated among one another. Variables from different concepts should have a low correlation.

3. Check Conditions

3.1 The amount of data should be enough for the correlation process:

library(psych)
psych::KMO(corMatrix) 
## Kaiser-Meyer-Olkin factor adequacy
## Call: psych::KMO(r = corMatrix)
## Overall MSA =  0.93
## MSA for each item = 
## Electoralprocessandpluralism      Functioningofgovernment 
##                         0.90                         0.95 
##       Politicalparticipation             Politicalculture 
##                         0.93                         0.95 
##               Civilliberties       X1ExternalIntervention 
##                         0.91                         0.93 
##          C1SecurityApparatus             P2PublicServices 
##                         0.95                         0.92 
##   E3HumanFlightandBrainDrain         E2EconomicInequality 
##                         0.95                         0.94 
##            S2RefugeesandIDPs             C3GroupGrievance 
##                         0.94                         0.93 
##            P1StateLegitimacy        C2FactionalizedElites 
##                         0.93                         0.94 
##       S1DemographicPressures                    E1Economy 
##                         0.93                         0.96 
##                P3HumanRights 
##                         0.93

3.2 The correlation matrix should not be an identity matrix:

# is identity matrix?
cortest.bartlett(corMatrix,n=nrow(dataForFA))$p.value>0.05
## [1] FALSE

3.2. The correlation matrix should not be singular

library(matrixcalc)

is.singular.matrix(corMatrix)
## [1] TRUE

If some conditions fail you may not expect a reliable result, however, you may continue to see the sources of the flaws.

4. Get suggestions for amount of clusters

fa.parallel(dataForFA, fa = 'fa',correct = T,plot = F)
## Parallel analysis suggests that the number of factors =  2  and the number of components =  NA

5. Compute the Factors

library(GPArotation)
resfa <- fa(dataForFA,
            nfactors = 2,
            cor = 'mixed',
            rotate = "varimax",
            fm="minres")

6. Explore results

### see results
print(resfa$loadings)
## 
## Loadings:
##                              MR1    MR2   
## Electoralprocessandpluralism -0.225 -0.872
## Functioningofgovernment      -0.516 -0.733
## Politicalparticipation       -0.303 -0.753
## Politicalculture             -0.434 -0.557
## Civilliberties               -0.282 -0.933
## X1ExternalIntervention        0.679  0.454
## C1SecurityApparatus           0.731  0.467
## P2PublicServices              0.893  0.330
## E3HumanFlightandBrainDrain    0.769  0.265
## E2EconomicInequality          0.796  0.378
## S2RefugeesandIDPs             0.671  0.362
## C3GroupGrievance              0.409  0.485
## P1StateLegitimacy             0.471  0.825
## C2FactionalizedElites         0.561  0.672
## S1DemographicPressures        0.831  0.322
## E1Economy                     0.800  0.290
## P3HumanRights                 0.478  0.787
## 
##                  MR1   MR2
## SS loadings    6.405 6.099
## Proportion Var 0.377 0.359
## Cumulative Var 0.377 0.736

You can see better using a cutoff:

print(resfa$loadings,cutoff = 0.5)
## 
## Loadings:
##                              MR1    MR2   
## Electoralprocessandpluralism        -0.872
## Functioningofgovernment      -0.516 -0.733
## Politicalparticipation              -0.753
## Politicalculture                    -0.557
## Civilliberties                      -0.933
## X1ExternalIntervention        0.679       
## C1SecurityApparatus           0.731       
## P2PublicServices              0.893       
## E3HumanFlightandBrainDrain    0.769       
## E2EconomicInequality          0.796       
## S2RefugeesandIDPs             0.671       
## C3GroupGrievance                          
## P1StateLegitimacy                    0.825
## C2FactionalizedElites         0.561  0.672
## S1DemographicPressures        0.831       
## E1Economy                     0.800       
## P3HumanRights                        0.787
## 
##                  MR1   MR2
## SS loadings    6.405 6.099
## Proportion Var 0.377 0.359
## Cumulative Var 0.377 0.736

The previous results serve to indicate if variables group in a meaningful way. In our example, you want to know if the indicators in each set are grouped together. These previous results can alse be visualized:

fa.diagram(resfa,main = "EFA results")

7. Make a decision

  • Can we use these results? Maybe not.
  • Can we improve these results? You can try to eliminate variables, interpreting the results.

8. Improve the Factor Analysis

8.1 Let’s remove some variables. These belong to ‘fragility’ but may be close to ‘democracy’:

ps=c("P1StateLegitimacy","P2PublicServices","P3HumanRights")
notPs=setdiff(names(dataForFA),ps)
dataForFA2=dataForFA[,notPs]

8.2 Recompute correlations

# esta es:
library(polycor)
corMatrix2=polycor::hetcor(dataForFA2)$correlations

8.3 Recheck conditions:

  1. KMO
library(psych)
psych::KMO(corMatrix2) 
## Kaiser-Meyer-Olkin factor adequacy
## Call: psych::KMO(r = corMatrix2)
## Overall MSA =  0.92
## MSA for each item = 
## Electoralprocessandpluralism      Functioningofgovernment 
##                         0.86                         0.96 
##       Politicalparticipation             Politicalculture 
##                         0.93                         0.94 
##               Civilliberties       X1ExternalIntervention 
##                         0.88                         0.93 
##          C1SecurityApparatus   E3HumanFlightandBrainDrain 
##                         0.96                         0.94 
##         E2EconomicInequality            S2RefugeesandIDPs 
##                         0.90                         0.92 
##             C3GroupGrievance        C2FactionalizedElites 
##                         0.90                         0.95 
##       S1DemographicPressures                    E1Economy 
##                         0.88                         0.94
  1. Bartlett
cortest.bartlett(corMatrix2,n=nrow(dataForFA2))$p.value>0.05
## [1] FALSE
  1. Singularity
library(matrixcalc)

is.singular.matrix(corMatrix2)
## [1] FALSE

8.4. Get suggestions

fa.parallel(dataForFA2, fa = 'fa',correct = T,plot = F)
## Parallel analysis suggests that the number of factors =  2  and the number of components =  NA

8.5 Compute factors

library(GPArotation)
resfa <- fa(dataForFA2,
            nfactors = 3,
            cor = 'mixed',
            rotate = "varimax",
            fm="minres")

8.6 Explore results

Visually:

fa.diagram(resfa,main = "EFA results (2)")

If you find no theory to explain this result, you can try smaller amount of factors (try 2):

library(GPArotation)
resfa <- fa(dataForFA2,
            nfactors = 2,
            cor = 'mixed',
            rotate = "varimax",
            fm="minres")

fa.diagram(resfa,main = "EFA results (3)")

8.7 Analyse new results

  • Communalities: How each variable is contributing to the factorizing process.
sort(resfa$communality)
##             C3GroupGrievance             Politicalculture 
##                    0.3999367                    0.4983043 
##            S2RefugeesandIDPs       Politicalparticipation 
##                    0.6022094                    0.6673252 
##   E3HumanFlightandBrainDrain       X1ExternalIntervention 
##                    0.6675217                    0.6942566 
##         E2EconomicInequality                    E1Economy 
##                    0.7238521                    0.7296954 
##       S1DemographicPressures          C1SecurityApparatus 
##                    0.7409543                    0.7523090 
##        C2FactionalizedElites      Functioningofgovernment 
##                    0.7566783                    0.8130100 
## Electoralprocessandpluralism               Civilliberties 
##                    0.8543870                    0.9745699
  • Complexity: To how many factors a variable seems closer:
sort(resfa$complexity)
## Electoralprocessandpluralism   E3HumanFlightandBrainDrain 
##                     1.141434                     1.177320 
##                    E1Economy               Civilliberties 
##                     1.196322                     1.221408 
##       S1DemographicPressures       Politicalparticipation 
##                     1.292111                     1.353628 
##            S2RefugeesandIDPs         E2EconomicInequality 
##                     1.424945                     1.442187 
##          C1SecurityApparatus       X1ExternalIntervention 
##                     1.540870                     1.589541 
##      Functioningofgovernment             Politicalculture 
##                     1.869219                     1.987251 
##             C3GroupGrievance        C2FactionalizedElites 
##                     1.990284                     1.997956

It is not that we have the prefect solution, but you need to eventually stop.

9. Compute the scores

Each factor is represented by an score:

summary(resfa$scores)
##       MR1               MR2         
##  Min.   :-2.2798   Min.   :-2.3848  
##  1st Qu.:-0.7765   1st Qu.:-0.8269  
##  Median : 0.1364   Median : 0.2563  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.7811   3rd Qu.: 0.8296  
##  Max.   : 1.7894   Max.   : 1.4822

As you see, the scores are not in the range from zero to ten; let’s make the change:

library(BBmisc)
efa_scores=normalize(resfa$scores, 
                       method = "range", 
                       margin=2, # by column
                       range = c(0, 10))
# save the scores in the data
fromPy$Fragile_efa=efa_scores[,1]
fromPy$Demo_efa=efa_scores[,2]

You do not normally have a means to validate the scores, but our example data has pre computed scores. Let’s use those to see if our scores make sense.

  • Democracy:
library("ggpubr")
ggscatter(data=fromPy, x = "Overallscore", y = "Demo_efa", 
          add = "reg.line", conf.int = TRUE, 
          cor.coef = TRUE, cor.method = "pearson",
          xlab = "DemoIndex (original)", ylab = "DemoIndex (efa)")
## `geom_smooth()` using formula 'y ~ x'

  • Fragility
ggscatter(data=fromPy, x = "Total", y = "Fragile_efa", 
          add = "reg.line", conf.int = TRUE, 
          cor.coef = TRUE, cor.method = "pearson",
          xlab = "FragileIndex (original)", ylab = "FragileIndex (efa)")
## `geom_smooth()` using formula 'y ~ x'

Let’s save the data frame fromPy for further use:

saveRDS(fromPy,file = 'fromPyPlus.RDS')